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Abstract 

We study the problem of deinterleaving a set of finite-memory (Markov) processes over disjoint 
finite alphabets, which have been randomly interleaved by a finite-memory switch. The deinterleaver has 
access to a sample of the resulting interleaved process, but no knowledge of the number or structure of 
the component Markov processes, or of the switch. We study conditions for uniqueness of the interleaved 
representation of a process, showing that certain switch configurations, as well as memoryless component 
processes, can cause ambiguities in the representation. We show that a deinterleaving scheme based 
on minimizing a penalized maximum-likelihood cost function is strongly consistent, in the sense of 
reconstructing, almost surely as the observed sequence length tends to infinity, a set of component 
and switch Markov processes compatible with the original interleaved process. Furthermore, under 
certain conditions on the structure of the switch (including the special case of a memoryless switch), 
we show that the scheme recovers all possible interleaved representations of the original process. 
Experimental results are presented demonstrating that the proposed scheme performs well in practice, 
even for relatively short input samples. 
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I. Introduction 

Problems in applications such as data mining, computer forensics, finance, and genomics, often 
require the identification of streams of data from different sources, which may be intermingled 
or hidden (sometimes purposely) among other unrelated streams, in large interleaved record files. 
In this haystack of records can lie buried valuable information whose extraction would be easier 
if we were able to separate the contributing streams. The deinterleaving problem studied in this 
paper is motivated by these applications (more detailed accounts of which can be found, for 
example, in Q, £3, ED). 

In our setting, the data streams, as well as the interleaving agent, will be modeled as 
sequences generated by discrete-time random processes over finite alphabets. Specifically, let 
Ai, A 2 , ... , A m be finite, nonempty, disjoint alphabets, let A = A\ U A 2 U ■ ■ ■ A m , and 
II = {A 1 , A 2 , . . . , An}- We refer to the A4 as subalphabets, and to II as a partition, of A. 
Consider m independent, component random processes Pi, P 2 , . . . , P m , defined, respectively, 
over Ai, A 2 , . . . , A m , and a random switch process P w over the alphabet II, independent of 
the component processes. The interleaved process P — Z n (Pi, P 2 , . . . , P m ; Pw) is generated as 
follows: At each time instant, a subalphabet Ai E H is selected according to P w , and the next 
output sample for P is selected from A, according to the corresponding process Pi (we say, 
loosely, that the switch "selects" Pi at that instant). The component processes p are idle when 
not selected, i.e., if P, is selected at time t, and next selected at time t + T, then the samples 
emitted by P at times t and t + T are consecutive emissions from p, regardless of the length 
of the intervening interval T. 

Given a sample z n from P, and without prior knowledge of the number or the composition 
of the subalphabets Ai, the deinterleaving problem of interest is to reconstruct the original 
sequences emitted by the component processes, and the sequence of switch selections. 

So far, we have made two basic assumptions on the structure of the interleaved system: the 
independence of the component and switch processes, and the disjointness of the subalphabets. 
The latter assumption implies that, given an interleaved input stream, identifying the partition 
IT is equivalent to identifying the component substreams and the sequence of switch selections. 
Thus, identifying the partition II is sufficient to solve the deinterleaving problem. Identifying the 
substreams when the subalphabets are not disjoint is also a problem of interest, but it appears 
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more challenging (T), and is outside the scope of this paper. Even with these assumptions, it is 
clear that without further restrictions on the component and switch processes, the problem defined 
would be either ill-posed or trivial, since two obvious hypotheses would always be available: 
the interleaved process P could be interpreted as having a single component P 1 = P, or as an 
interleaving of constant processes over singleton alphabets interleaved by a switch P w essentially 
identical to P. Therefore, for the problem to be meaningful, some additional constraints must be 
posed on the structure of the component and switch processes. In this paper, we study the case 
where the components and switch are ergodic finite memory (Markov) processes, i.e., for each 
i E {1, 2, . . . , m, w}, there is an integer ki > such that for any sufficiently long sequence u l over 
the appropriate alphabet, we have Pi(u t \u t ^ 1 ) = P^u^u 1 ^). We assume no knowledge or bound 
on the process orders ki, and refer to P in this case as an interleaved Markov process (IMP). 
Except for some degenerate cases (e.g., when all the component processes are memoryless), the 
IMP P is generally not a finite memory process, since the interval between consecutive selections 
of a component process is unbounded. Hence, in general, the two obvious hypotheses mentioned 
above are not available, and the deinterleaving problem for IMPs is well-posed, non-trivial, and, 
as we shall show, solvable. 

When P = Z n (-Pi, P2, • • • , P m \ P w ) for finite memory processes P 1 , P 2 , . . . , P m , P w , we say 
that II is compatible with P, and refer to In (Pi, P2, • • • , P m ; P w ) also as an IMP representation of 
P. Notice that, given an IMP P, any partition IT of A induces a set of deinterleaved component 
and switch processes. In general, however, if IT is the "wrong" partition (i.e., it is incompatible 
with P), then either some of the induced sub-processes P[ or P^ will not be of finite order, or 
some of the independence assumptions will be violated. There could, however, be more than one 
"right" partition: IMP representations need not be unique, and we may have partitions il^IT 
such that both II and II' are compatible with P. We refer to this situation as an ambiguity in 
the IMP representation of PQ 

In this paper, we study IMP ambiguities, derive conditions for uniqueness of IMP represen- 
tations, and present a deinterleaving scheme that identifies, eventually almost surely, an IMP 
representation of the observed process. Under certain conditions, including all the cases where 

'Notice that since P and II uniquely determine the component and switch processes, two different IMP representations of 
the same process P must be based on different partitions. 
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the switch is memoryless, the scheme will identify all IMP representations of the process. The 
solution is based on finding a partition II of A and an order vector k = (ki, k 2 , . . . , k m , k w ) 
that minimize a penalized maximum-likelihood (penalized ML) cost function of the form 
Cu,k{z n ) = nHn^{z n ) + (3nlogn, where Hn,\n(z n ) is the empirical entropy of the observed 
sequence z n under an IMP model induced by II and k, k is the total number of free statistical 
parameters in the model, and (3 is a nonnegative constant. Penalized ML estimators of Markov 
process order are well known (cf. flU, 0, (6||). Here, we use them to estimate the original 
partition II, and also the Markov order of the processes P, and the switch P w . 

The deinterleaving problem for the special case where all processes involved are of order 
at most one has been previously studied in 0]], where an approach was proposed that 
could identify an IMP representation of P with high probability as n— >oo (the approach 
as described cannot identify multiple solutions when they exist; instead, all cases leading to 
possible ambiguities are excluded using rather coarse conditions). The idea is to run a greedy 
sequence of tests, checking equalities and inequalities between various event probabilities (e.g., 
P(ab)=£P(a)P(b), P(abc) = P(a)P(b)P(c), a,b,c G A), and permanently clustering symbols 
into subalphabets sequentially, according to the test results (sequentiality here is with respect to 
the alphabet processing, not the input sequence, which has to be read in full before clustering 
begins). Empirical distributions are used as proxies for the true ones. Clearly, equalities between 
probabilities translate only to "approximate equalities" subject to statistical fluctuations in the 
corresponding empirical quantities, and an appropriate choice of the tolerances used to determine 
equality, as functions of the input length n, is crucial to turn the conceptual scheme into an 
effective algorithm. Specific choices for tolerances are not discussed in (T). The attractive 
feature of the approach in [1] is its low complexity; equipped with a reasonable choice of 
tolerance thresholds, an efficient algorithm for the special case of processes of order one can 
be implemented. However, as we shall see in the sequel, the convergence of the algorithm is 
rather slow in practice, and very long samples are necessary to achieve good deinterleaving 
performance, compared to the schemes proposed here. The problem of deinterleaving hidden- 
Markov processes was also studied, mostly experimentally, in [2J. Another variant of the problem, 
where all the component processes are assumed to be identical (over the same alphabet), of order 
one, and interleaved by a memoryless switch, was studied in J3). 

We note that IMPs are a special case of the broader class of switching discrete sources studied 
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in 0, with variants dating back as early as (SJ. However, the emphasis in []7|| is on universally 
compressing the output of a switched source of known structure, and not on the problem studied 
here, which is precisely to identify the source's structure. 

The rest of the paper is organized as follows. In Section [II] we present some additional 
definitions and notation, and give a more formal and detailed definition of an IMP, which will be 
useful in the subsequent derivations. We also show that an IMP can be represented as a unifilar 
finite-state machine (FSM) source (see, e.g., fl9)), whose parameters satisfy certain constraints 
induced by the IMP structure. In Section [In] we study conditions for uniqueness of the IMP 
representation of a process. We identify two phenomena that may lead to ambiguities: a so-called 
alphabet domination phenomenon which may arise from certain transition probabilities in the 
switch being set to zero (and which, therefore, does not arise in the case of memoryless switches), 
and the presence of memoryless component processes. We derive a set of sufficient conditions 
for uniqueness, and, in cases where ambiguities are due solely to memoryless components (the 
so-called domination-free case, which includes all cases with memoryless switches), characterize 
all the IMP representations of a process P. Most of the derivations and proofs for the results of 
Section III are presented in Appendix [A] In Section IV we present our deinterleaving scheme, 



establish its strong consistency, and show that in the domination-free case, it can identify all valid 
IMP representations of the interleaved process. The derivations and proofs for these results are 
presented in Appendix [Bj Finally, in Section [V] we present some experimental results for practical 
implementations of deinterleaving schemes. We compare the performance of our scheme with 
that of an implementation of the scheme of [1J (with optimized tolerances) for the case of IMPs 
with memoryless switches, showing that the ML-based deinterleaver achieves high accuracy rates 
in identifying the correct alphabet partition for much shorter sequences than those required by 
the scheme of [Q]|. Our ideal scheme calls for finding the optimal partition through an exhaustive 
search, which is computationally expensive. Consequently, we show results for a randomized 
gradient descent heuristic that searches for the same optimal partition. Although in principle 
this approach sacrifices the optimality guarantees of the ideal scheme, in practice, we obtain the 
same results as with exhaustive search, but with a much faster and practical scheme. We also 
present results for IMPs with switches of order one. We show, again, that the ML-based schemes 
exhibit high deinterleaving success rates for sequences as short as a few hundred symbols long, 
and perfect deinterleaving, for the samples tested, for sequences a few thousand symbols long. 



II. Preliminaries 

A. Definitions 

All Markov processes are assumed to be time-homogeneous and ergodic, and, consequently, 
to define limiting stationary distributions ifTOll . We denote the (minimal) order of P« by hi = 
ord(Pj), refer to reachable strings u ki as states of Pj, and denote the set of such states by 
<S(Pj), i E {1, 2, . . . , m, w}. Some conditional probabilities may be zero, and some fcj-tuples 
may be non-reachable, but all states are assumed to be reachable and recurrent. We further 
assume that all symbols a E A (and subalphabets A E IT) occur infinitely often, and their 
stationary marginal probabilities are positive. We make no assumptions on the initial conditions 
of each process, and, in our characterization of ambiguities, distinguish processes only up to their 
stationary distributions, i.e., we write P = P' if and only if P and P' admit the same stationary 
distribution. All probability expressions related to stochastic processes will be interpreted as 
(sometimes marginal) stationary probabilities, e.g., P(w), or p(a|w) = Pi(ua)/Pi(u) when u is 
not long enough to define a state of p. Aside from simplifying some notations, this assumption 
makes our results on uniqueness of IMP representations slightly stronger than if we had adopted 
a stricter notion of process equivalence (e.g., actual process identity). 

For a string u l = U\U2--.u t E A 1 , let A n (V) E II' denote the corresponding string of 
subalphabets, i.e., A n (V)j = A, where i is the unique index such that Uj E A^ E II, 1 < j < t. 
We sometimes refer to An(w') also as the switch sequence corresponding to u l . Also, for A! C A, 
and a string u over A, let u[A'\ denote the string over A' obtained by deleting from u all symbols 
that are not in A'. The IMP P = X n (Pi, P2, • • • , P m ; P w ) is formally defined as follows: Given 
z f E A*, t > 1, and assuming z t E Ai, we have 

P{z t \z 1 ' 1 ) = P w (A l |A n (^- 1 ))P(^k*- 1 [A l ]) . (1) 

It is readily verified that ([TJ) completely defines the process P, which inherits whatever initial 
conditions hold for the component and switch processes, so that ([TJ) holds for any conditioning 
string t > 1 (including z 1 ^ 1 = A). Also, by recursive application of |T]), after rearranging 
factors, we obtain, for any sequence z n E A n , 

m 

P(z n )=P w {A n {z n ))\[P i {z n [A i ]). (2) 

i=l 
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Notice that when initial conditions are such that the probabilities on the right-hand side of ([2]) 
are stationary, the equation defines a stationary distribution for P. (We adopt the convention that 
-Pi(A) = 1, i E {1, 2, . . . , m, w}, and, consequently, P(\) = 1.) 

For conciseness, in the sequel, we will sometimes omit the arguments from the notations 
Xn or Xn', assuming that the respective sets of associated subalphabets and processes (resp. 
{Ai}, {Pi} or {A[}, {PI}) are clear from the context. For IMP representations X n and X n /, we 
write Xn = Tn> if the representations are identical, i.e., II = II' and Pi = P[, % E {1, 2, . . . , m, w} 
(in contrast with the relation X n = Xq/, which is interpreted to mean that X n and In 1 generate 
the same process). 

We will generally denote sequences (or strings) over A with lower case letters, e.g., u £ A*, 
and sequences over II with upper case letters, e.g., U E II*. We say that u n E A n and U n E IT 1 
are consistent if P(u n ) > and U n = Au(u n ). Clearly, for every sequence u n with P{u n ) > 
there exists a sequence U n = An(«"), with P w {U n ) > 0, that is consistent with u n ; conversely, 
if P w (U n ) > 0, it is straightforward to construct sequences u n consistent with U n . Unless 
specified otherwise, we assume that an upper case-denoted alphabet sequence is consistent with 
the corresponding lower case-denoted string, e.g., when we write UV = An(uv ), we also imply 
that U = A u (u) and V = A n (v). 

B. IMPs and FSM sources 

A finite state machine (FSM) over an alphabet A is defined by a triplet F = (S, s , f), where 
S is a set of states, so E S is a (possibly random) initial state, and f : S x A S is a next-state 
function. A (unifilar) FSM source (FSMS) is defined by associating a conditional probability 
distribution P F (-\s) with each state s of F, and a probability distribution Pp lt (-) on the initial 
state so- To generate a random sequence x n , the source draws s according to -P™'(-) and then 
draws, for each i, 1 < i < n, a symbol xi E A distributed according to Pp(-|sj_i), and transitions 
to the state Sj = /(si_i, Xi). Markov sources of order k over A are special cases of FSMSs with 
S = A k . We next observe that an IMP can be represented as an FSMS. For convenience, we 
will assume in the discussion that FSMSs have arbitrary but fixed initial states. In particular, we 
will assume that a fixed initial state Sq E <S(Pj) is defined for the component/switch processes 
Pj, j E {1,2, ... ,m, w}, where we recall that S(Pj) denotes the state set of Pj. The results are 
easily generalized to arbitrary initial state conditions, since any initial state distribution can be 
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written as a convex combination of fixed initial state conditions. 

We refer to the vector k = [kx, k 2 , ■ ■ ■ , k m , k w ), where kj = ord(P J ), j 6 {1,2,..., m, w}, as 
the order vector of the IMP Xn. We denote by fj the next-state function of the FSM associated 
with Pj, j G {1, 2, . . . , m, w}, and define the initial state vector s = (sq > s< o \ • • • 5 s o j s o^)- 
We consider now an FSM X'n.k = (S, s , /), with state set S = Si x S 2 x • • • S m x S w , and next- 
state function / defined as follows: Given a state s = (s^\ s^ 2 \ . . . , s^, s^) G 5, and a G ^4 
such that A n (o) = A*, we have /(s, a) = s' = (s'M, s'( 2 ), . . . , s'( m ), s'( w )) where s'C?) = s (i) for 
j G {1,2, . . . ,m} \ {i}, s'W = /i(s (i) ,a), and s'( w ) = / w (s (w) , A;). To complete the definition 
of an FSMS, for each state s G S, we define the conditional probability distribution 

Pn,k(a|s) = P w (^| S ( w ))P(a|s«), aei, A n (a) = A i en. (3) 

The following proposition is readily verified. 

Proposition 1: Tu,k, with transition probabilities Pn,k> generates P=X n (Pi, P2, . . . , P m , P w ). 

Results analogous to Proposition [T] for switching discrete sources are given in 0. The class 
of finite state sources considered in 0, however, is broader, as unifilarity is not assumed. 

It follows from the ergodicity and independence assumptions for IMP components and switch 
that P is an ergodic FSMS, and every state s G S has a positive stationary probability. Let 
cti = \Ai\, 1 < i < m, and a = \A\ = Yli=i a i- By the definition of the state set S, we 
have \S\ < m fcw Y\T=i a i 4 (equality holding when all fcj-tuples over the appropriate alphabet are 
reachable states of P,, j G {1,2, ... ,m, w}). Hence, the class of arbitrary FSMSs over A, with 
underlying FSM Xn.k, would have, in general, up to 

m 

/C(n, k) = (a - l)m K Y[ af (4) 

i=i 

free statistical parameters. The conditional probability distributions in ([3]), however, are highly 
constrained, as the parameters P n ,k(o|s) satisfy relations of the form 

P w (^| S / (-))P n , k (a|s) = P w (A t \s^)P UM (a\s'), 

where A4 = A n (a), for all states s' such that sW = s'W. In particular, it follows directly from ^ 
that Pn,k(o|s) = Pn,k(a|s') if = s'W and s^ w ' = s'( w ). Overall, the number of free parameters 
remains, of course, that of the original component Markov processes and switch, i.e., up to 

m 

«(n,k) = ^af^c* - 1) + (m- l)m few , (5) 

i=l 
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which is generally (much) smaller than /C(II,k). 

We refer to an FSMS satisfying the constraints implicit in ([3]) as an IMP -constrained FSMS. 
Notice that, given a specific IMP P = In (Pi, P2, • • • , P m ; Pw), the associated FSM P n ,k may 
also incorporate "hard constraints" on the parameters of (maybe other) FSMSs based on Pn,k, 
due to some /^-tuples possibly being non-reachable in Pj, and the corresponding transition 
probabilities being identically zero. Later on, when our task is to estimate the FSMS without 
any prior knowledge on the structure of P, we will assume that candidate structures In,k are 
fully parametrized, i.e., the class of IMP-constrained FSMS generated by In,k has exactly k 
free statistical parameters (we omit the arguments of /C and k when clear from the context). 

III. Uniqueness of IMP representations 

In this section, we study conditions under which the IMP representation of a process is 
unique, and, for IMPs that are free from certain "pathologies" that will be discussed in the 
sequel, characterize all IMP representations of a process when multiple ones exist. Notice that 
although, as shown in Section |TTJ. IMPs can be represented as constrained FSM sources, the 
study of ambiguities of IMP representations differs from the problem of characterizing different 
FSM representations of a source ifTT) . or more generally of representations of hidden Markov 
processes 021. It is known [fTT| | that all FSMs that can generate a given FSMS P are refinement^ 
of a so called minimal FSM representation of the source. In particular, this applies to the 
FSM corresponding to any IMP representation. However, the minimal FSM representation is 
not required to satisfy the IMP constraints, so it needs not coincide with a minimal (or unique) 
IMP representation. Notice also that, when defining IMPs and their FSM representations, we 
have assumed that the orders kj of all the Markov processes involved are minimal, thus excluding 
obvious FSM refinements resulting from refining some of the individual Markov processes. 

A. Alphabet domination 

Let A, B be arbitrary subalphabets in EL We say that A dominates B (relative to P w ) if there 
exists a positive integer M such that if P w has emitted M occurrences of B without emitting 

2 A refinement 1 13] of an FSM F = (S, s , f) is an FSM F + = (S + , s+, /+) such that for some fixed function g : S + -> S 
and any sequence x n , the respective state sequences {si} and {sf} satisfy Si = g{sf), < i < n (for example, the FSM 
underlying a Markov process of order k + 1 is a refinement of the FSM underlying one of order k). By suitable choices of 
conditional probabilities, a refinement of F can generate any process that F can generate. 
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Fig. 1. A switch P w of order two over II = {A, B, C}. Arcs are labeled X : £, where X is the emitted symbol and £ the 
corresponding transition probability. Transitions not drawn are assumed to have probability zero. 

one of A, then with probability one P w will emit an occurrence of A before it emits another 
occurrence of B; in other words, if P W (U) > 0, then (7[{A,P}] does not contain any run of 
more than M consecutive occurrences of B. We denote the domination relation of A over B as 
A □ B, dependence on P w being understood from the context; when A does not dominate B, 
we write A B (thus, for example, A A). We say that A is dominant (in II, relative to P w ) 
if either m — 1 (i.e., II = {A}) or A □ B for some B E II, and that A is totally dominant if 
either m = 1 or A □ B for all B E II \ {A}. If A □ B and P □ A, we say that A and B are 
in mutual domination, and write A^B. It is readily verified that domination is an irreflexive 
transitive relation. When no two subalphabets are in mutual domination, the relation defines a 
strict partial order (see, e.g., |[T4||) on the finite set II. We shall make use of the properties of 
this strict partial order in the sequel. 

Domination can occur only if some transition probabilities in P w are zero; therefore, it never 
occurs when P w is memoryless. The approach for ord(P w ) = 1 in [HJ assumes that P W (A\A) > 
for all A E II. Clearly, this precludes alphabet domination. However, the condition is too stringent 
to do so, or as a condition for uniqueness. 

Example 1: Consider an IMP P = In (Pi, P2, P3; P w ) with II = {A, B, C}, and P w as defined 
by Figure[T] where ord(P w ) = 2, and transitions are labeled with their respective emitted symbols 
and probabilities. We assume that [i E (0, 1] and p E (0, 1). For this switch, we have A □ B, 
A □ C, and P=u=C; A is totally dominant, and, if p < 1, it is not dominated. If p = 1, every 
pair of subalphabets is in mutual domination. In all cases, P w is aperiodic. 
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(a) (b) 




Fig. 2. Switches for ambiguous IMP representation: (a) P w over {C, D}, ord(P w ) = 1 (C = AUB, and the internal structure 
of Pc is also shown), (b) P4 over {A, B , D}, ord(P4) = 2. Arcs are labeled with their corresponding emitted symbols and 
transition probabilities; transitions not shown have probability zero. 



B. Conditions for uniqueness 

We derive sufficient conditions for the uniqueness of IMP representations, and show how 
ambiguities may arise when the conditions are not satisfied. The main result of the subsection 



is given in the following theorem, whose derivation and proof are deferred to Appendix A-A 
Theorem 1: Assume that, for an IMP P = In (-Pi, P%, ■ ■ ■ , P m ', P w )> 

i) no two subalphabets in II are in mutual domination, 

ii) no subalphabet in II is totally dominant, and 

iii) none of the processes Pj is memoryless. 

Then, if P = Xw(P[, P^ ■ ■ ■ , P m ''i P w ) f° r some partition IT and finite memory processes 
P{, P^ . . . , P m >, P^, we must have I n = In'- 

Example 2: We consider alphabets A, B, D, and C = A U B, and respective associated 
processes Pa, Pb, Pd, Pc- Part (a) of Fig. [2] shows a switch P w of order 1 over II = {C,D}. 
Here, Pc is in itself an interleaved process Pc = I{a,b}(Pa, Pb', P^) w i tn Pb chosen as a 
memoryless process so that P c has finite memory (specifically, ord(Pc) < 2ord(P^)); P D is 
not memoryless, and we have u, fx G (0,1). Part (b) shows a switch P^ of order two over 
IT = {A,B,D}. State *A (resp. *B) represents all states that end in A (resp. B). It is readily 
verified that P = X n (P c , Pd', P w ) = In' (Pa, Pb, Pd', P4)> so P is an ambiguous IMP. It is also 
readily verified that both In and In' violate Condition ([n]) of Theorem [TJ C is totally dominant 
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in X n , and A is totally dominant in X n '. In fact, the figure exemplifies a more detailed variant of 
Theorem [TJ presented as Theorem [2] below, which characterizes ambiguities when Condition (|TTJ) 
of the original theorem is removed. 

Given partitions II and IT of A, we say that Ai G II splits in IT' if Aj is partitioned into 
subalphabets in II', i.e. A j C A, for all G IT such that A^- n A { ^ (p. 

Theorem 2: Let II = {Ai, A 2 , . . . , A m } be a partition of A, and consider an IMP represen- 
tation P = Xu{Pi, P<2i ■ ■ ■ , Pm'i P-w) sucn that no two subalphabets are in mutual domination, 
and none of the processes Pj is memoryless. Then, if P = X n '(P(, P 2 , ■ ■ ■ , P' m <] Pi,) for some 
partition II' = {A[, A' 2 , . . . , A' m ,} of A, we must have Ai G IT' for all subalphabets A* G II 
except possibly for one subalphabet A io E II, which must be totally dominant and split in II'. 

The proof of Theorem [2] is also deferred to Appendix A-A| The theorem covers the special 



case m = 1, which is excluded by Condition (jllj) in Theorem [T] In this case, the IMP is actually 
a finite-memory process, which admits the two "obvious" IMP representations (with m = 1 and 
m — \ A\ — \Ai\, respectively) mentioned in the introduction. 

C. Ambiguities due to memoryless components in the domination-free case 

In this subsection, we eliminate Condition (iii) of Theorem [T] while strengthening Con- 



ditions ^ and (JnJ) by excluding all forms of alphabet domination. We characterize all the 
representations of an IMP when ambiguities, if any, are due solely to memoryless components. 

We say that a partition II' is a refinement of II if every subalphabet Ai e II splits in II'. When 
IT is a refinement of II, we define the function \I/n,rr : IT — >• II mapping a subalphabet A- e II' 
to the subalphabet A» G II that contains it. The notation and map extend in the natural way to 
arbitrary strings, namely \I/n,n' : (W) k — > n fc for all k > 0. We will omit the indices II, II' from 
^ when clear from the context. 

Lemma 1: Consider a partition II = {Ai, A 2 , . . . , A m }, together with a refinement IT' = 
{£?!, B 2 , A 2 ,..., A m } of II (i.e., A x = B x U B 2 ). Let P = X n (Pi, P 2 , . . . , P m ; P w ), where P x 
is memoryless, and let P' = Xn^P^, P[ 2 \ P 2 , . . . , P m ; P^), where both pf 1} and pf 2) are 
memoryless. Then, P = P' if and only if the following conditions hold: 



p ' ){b) = p~(B~y beB ^ 3 e ^> (6) 

S(P^) = {S' G (n') fc "| *(5") G 5(P W )}, (7) 
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A (6) 



and for all A E IT and S' E S(P^), with S = V(S'), 



, P W {A\S), A = At, i > 2, 

K(AS') = < (8) 

P^A^P^B,), A = Bj,j = 1,2. 



Remarks. The proof of Lemma [TJ is deferred to Appendix A-B The lemma is interpreted as 



follows: since, given Xn, processes Pf , P x , and P4 can always be defined to satisfy (6)-(8), 
an IMP P with a nontrivial memory less component always admits alternative representations 
where the alphabet associated with the memoryless process has been split into disjoint parts (the 
split may be into more than two parts, if the lemma is applied repeatedly). We refer to such 
representations as memoryless refinements of the original representation Xn- Using the lemma 
repeatedly, we conclude that P admits a refinement where all the memoryless components are 
defined over singleton alphabets. On the other hand, the memoryless components pf 1 " and P± 
of P' can be merged if and only if P^ satisfies the constraint 

P' w (B 2 \S')= 1 P^B l \S') (9) 

for a constant 7 independent of S' E S(P^). Indeed, when ([£]) holds, we set Pi (Pi) = 1/(1 + 
7) and P\{B 2 ) = 7/(1 + 7), and Pi,P w are defined implicitly by (|6])-([8]). Notice that the 
constraint ^ is trivially satisfied when the switch P^ is memoryless (and so is also the resulting 
P w ). Thus, in this case, memoryless component processes can be split or merged arbitrarily 
to produce alternative IMP representations. When the switch has memory, splitting is always 
possible, but merging is conditioned on ([9]). We refer to a representation where no more mergers 
of memoryless processes are possible, as well as to the corresponding partition IT, as canonical 
(clearly, the canonicity of II is relative to the given IMP)|^] 

We denote the canonical representation associated with an IMP P = Xn by (Xn)*, and the 
corresponding canonical partition by (TV) p. Also, we say P is domination-free if there is no 
alphabet domination in any IMP representation of P. The main result of the subsection is given 



in the theorem below, whose proof is presented in Appendix A-B 

Theorem 3: Let P = X n and P' = Xq' be domination-free IMPs over A. Then, P = P' if 
and only if (X n )* = (X w )*. 

3 The particular case of this result for IMPs with memoryless switches discussed in 1151 uses a slightly different definition 
of canonicity. 
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Theorem [3] implies that, in the domination-free case, all the IMP representations of a process 
are those constructible by sequences of the splits and mergers allowed by Lemma [T] In particular, 
this always applies to the case of memoryless switches, where domination does not arise. 

Corollary 1: Let P = X n and P' = X n i be IMPs over A, where the switches P w and P^ are 
memoryless. Then, P — P' if and only if (X n )* = (2n')*- 

IV. The deinterleaving scheme 

Given any finite alphabet A, a sequence u l 6 A 1 , and a nonnegative integer k, denote by 
the kth order (unnormalized) empirical entropy of u l , namely, = — log P^w*), 

where Pfc(V) is the ML (or empirical) probability of u l under a kth order Markov model with a 
fixed initial state. Let z n be a sequence over A. An arbitrary partition II of A naturally defines 
a deinterleaving of z 11 into sub- sequences Zj = 2 n [A], 1 < i < m, with a switch sequence 
Z w = An(z"). Given, additionally, an order vector k = (ki, k 2 , . . . , k m , k w ), we define 

m 

%(^)=x;4w+4 w (z w ). 

i=l 

This quantity can be regarded as the (unnormalized) empirical entropy of z n with respect to 



F = Fu,icfar an IMP -constrained FSMS (as discussed in Subsection II-B). Indeed, let Pn,ic(z n ) 
denote the ML probability of z n with respect to F under IMP constraints, i.e., denoting by 
T > x{J r u,\i) the class of all IMPs generated by F (i.e., all FSMSs based on F with parameter 
vectors satisfying the IMP constraints), we have 

Pn,k(^ n )= max P(z n ) . (10) 

PGPx(^n,k) 

Clearly, by Pn,k(z n ) is obtained by maximizing, independently, the probabilities of the 
component and switch sequences derived from z n , and, thus, we have Hn,u.{z n ) = — log -Pn,k(z ra )- 
Notice that Pu,k(z n ) is generally different from (and upper-bounded by) the ML probability with 
respect to F for an unconstrained FSMS; this ML probability will be denoted Pp(z n ). Next, we 
define the penalized cost of z n relative to II and k as 

C n ,v(z n ) = P n ,k(^) + ^log(n + l), (11) 
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where k = k(H, k), as given in ([5]), is the number of free statistical parameters in a generic 
IMP-constrained FSMS based on F, and (3 is a nonnegative (penalization) constant]^] 

Given a sample 2 n from an IMP P, our deinterleaving scheme estimates a partition fl(z n ), 
and an order vector k(z n ), for the estimated IMP representation of P. The desired estimates are 
obtained by the following rule: 

(fl(z n ),k(^)) = arg min C u ,y(z n ), (12) 
V / (ir,k') 

where (II', k') ranges over all pairs of partitions of A and order vectors k'. In the minimization, 

if Cxi',y{z n ) = Cu",k"(z n ), for different pairs (II', k') and (II", k"), the tie is broken first in favor 

of the partition with the smallest number of alphabets. Notice that although the search space 



in ( 12 ) is defined as a Cartesian product, once a partition II' is chosen, the optimal process orders 
k'j are determined independently for each je{l,2,...,m,w}, in a conventional penalized ML 
Markov order estimation procedure (see, e.g., [6]). Also, it is easy to verify that the optimal 



orders kj must be O(logn), reducing the search space for k' in (12). 



Our main result is given by the following theorem, whose derivation and proof are presented 



in Appendix [Bj Recall that (II) p denotes the canonical partition of P (Subsection III-C). 

Theorem 4: Let P = In (Pi, P2, ■ ■ ■ , P m \ Pv), and let z n be a sample from P. Then, for 
suitable choices of the penalization constant (3, fl(z n ) is compatible with P, and k(z n ) reproduces 
the order vector of the corresponding IMP representation Z^, almost surely as n — > 00. 
Furthermore, if P is domination-free, we have 

fl(z n ) = (II) p a.s. asn-^oo. 

Remarks. 

• Theorem [4] states that our scheme, when presented with a sample from an interleaved 
process, will almost surely recover an alphabet partition compatible with the process. If the 
interleaved process is domination-free, the scheme will recover the canonical partition of the 
process, from which all compatible partitions can be generated via repeated applications of 

4 For convenience, we set the penalty terms in jl l\ all proportional to log(?i + 1), rather than the term corresponding to Z; 
being proportional to log \zi\. Given our basic assumptions on switch processes, if z™ is a sample from an IMP, \zi\ will, almost 
surely, be proportional to n. Therefore, the simpler definition adopted has no effect on the main asymptotic results. Clearly, 
using log(n + 1) in lieu of logn, which will be convenient in some derivations, is also of negligible effect. 
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Lemma [TJ The difficulty in establishing the first claim of the theorem resides in the size of the 



class of models that participate in the optimization (12). The fact that a compatible partition 
will prevail over any specific incompatible one eventually almost surely, for any penalization 
coefficient (3 > 0, will be readily established through a large deviations argument. However, 
the class contains models whose size is not bounded with n. In fact, it is well known (see, 
e.g., ifTolO that the stationary distribution of the ergodic process P can be approximated 
arbitrarily (in the entropy sense) by finite memory processes of unbounded order. Thus, 
without appropriately penalizing the model size, a sequence of "single stream" hypotheses 
of unbounded order can get arbitrarily close in cost to the partitions compatible with P. 
We will prove that an appropriate positive value of (3 suffices to rule out these large models 
that asymptotically approach P. To establish the second claim of the theorem, we will take 
advantage of the observation that the canonical representation of a domination-free IMP, is, 
in a sense, also the most "economical". Indeed, comparing the number of free statistical 
parameters in the two IMP representations considered in Lemma [TJ we obtain, using ([5]), 

k(W, k') - k(U, k) = m{m + l) few - (m - l)m fcw - 1 . (13) 



It is readily verified that the expression on the right hand side of ( |T3| ) vanishes for k w = 0, 
and is strictly positive when k w > (since m > 1). Therefore, splitting a memory less 
component as allowed by Lemma [TJ in general, can only increase the number of parameters. 
Thus, the canonical partition minimizes the model size, and with an appropriate choice of 
f3 > 0, our penalized ML scheme will correctly identify this minimal model. 
• If a bound is known on the orders of the component and switch processes, then it will 
follow from the proof in Appendix [B] that the first claim of Theorem [4] can be established 
with any (3 > 0. However, an appropriate positive value of (3 is still needed, even in this 
case, to recover the canonical partition in the second claim of the theorem. As mentioned, 
our deinterleaving scheme assumes that IMPs based on J-n,k are fully parametrized, i.e., 
the class has k free statistical parameters. If the actual IMP being estimated is less than 
fully parametrized (i.e., it does have some transition probabilities set to zero), the effect of 
penalizing with the full k is equivalent to that of using a larger penalization coefficient /3. 
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V. Experimental results 

We report on experiments showing the performance of practical implementations of the 
proposed deinterleaver. The experiments were based on test sets consisting of 200 interleaved 
sequences each. Each sequence was generated by an IMP with m=3, subalphabet sizes 
cti=4, «2=5, 0:3=6, component Markov processes of order ki < 1 with randomly chosen 
parameters, and a switch of order k w < 1 as described below. In all cases, the switches were 
domination-free. Deinterleaving experiments were run on prefixes of various lengths of each 
sequence, and, for each prefix length, the fraction of sequences correctly deinterleaved was 
recorded. 

In the first set of experiments, the component Markov processes, all of order one, were 
interleaved by uniformly distributed memoryless switches (i.e., k = (1,1,1,0)). We compared 
the deinterleaving performance of the ML-based scheme proposed here with that of an 
implementation of the scheme of flU, with tolerances for the latter optimized (with knowledge 
of the correct partition) to obtain the best performance for each sequence length. Two variants 
of the ML-based scheme were tested: Variant (a) implements ( fl"2| ) via exhaustive search over all 
partitions^] Since this is rather slow, a heuristic Variant (b) was developed, based on a randomized 
gradient descent-like search. This variant, which is briefly described next, is much faster, and 
achieves virtually the same deinterleaving performance as the full search. 

We define the neighborhood of radius t of a partition n, denoted J\f t (lV), which consists 
of all partitions IT obtained from II by switching up to t symbols of A from their original 
subalphabets in II to other subalphabets (including possibly new subalphabets not present in II). 
The main component of the heuristic starts from an input sequence z n and a random partition LI 
of A, and exhaustively searches for the partition II' that minimizes the cost Cn>(z n ) within the 
neighborhood A/t(n ), for some small fixed value of t. The minimizing partition then becomes the 
center for a new exhaustive neighborhood search. This "greedy" deterministic process continues 
until no improvements in the cost function can be obtained. At this point, the best partition 
II observed so far is perturbed by picking a random partition n' E J\f r (H), for a fixed radius 

5 We recall that given a sequence z n and a partition II, the order vector k minimizing the cost Cn,k(z™) is determined through 
conventional penalized-ML order estimators for the various sub-sequences induced by II. We assume that this minimizing order 
vector is used in all cost computations, and omit further mention of it. 
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TABLE I 

Fraction of correctly deinterleaved sequences (out of 200) vs. sequence length, for two variants of 

THE PROPOSED SCHEME (ML(a) AND ML(b)), AND FOR THE SCHEME OF Q. A PENALIZATION CONSTANT /3 = § WAS 

USED IN ALL CASES FOR THE ML-BASED SCHEMES. 





memoryless switch 


switch with memory 




k = 


(1,1,1,0) 


k= (1,1, 1,1) 


k= (0,1,1,1) 


n 


ML (a) 


ML (b) 


CI 


ML (a) 


ML (b) 


ML (b) 
canonical 


ML (b) 
compatible 


250 


0.010 


0.010 


0.000 


0.310 


0.300 


0.215 


0.225 


500 


0.135 


0.130 


0.000 


0.635 


0.620 


0.600 


0.625 


1000 


0.440 


0.420 


0.000 


0.915 


0.915 


0.880 


0.900 


2500 


0.820 


0.815 


0.000 


0.995 


0.995 


0.990 


0.990 


5000 


0.960 


0.960 


0.005 


1.000 


1.000 


1.000 


1.000 


10000 


0.990 


0.990 


0.030 


1.000 


1.000 


1.000 


1.000 


15000 


1.000 


1.000 


0.080 


1.000 


1.000 


1.000 


1.000 


20000 


1.000 


1.000 


0.135 


1.000 


1.000 


1.000 


1.000 


50000 




1.000 


0.460 




1.000 


1.000 


1.000 


100000 




1.000 


0.770 




1.000 


1.000 


1.000 


500000 




1.000 


0.965 




1.000 


1.000 


1.000 


1000000 




1.000 


0.980 




1.000 


1.000 


1.000 



r > t, and the deterministic search is repeated using U' Q in lieu of il as the starting point. 
The routine stops if a given number N of consecutive rounds of such perturbations do not 
yield further cost reductions, at which point the best partition II observed so far is returned as a 
candidate solution. To improve deinterleaving reliability, this basic scheme can be run for several 
independent starting random partitions il , noting the overall cost minimum. The number R of 
such outer iterations, the maximum count N of consecutive perturbations without improvement, 
and the neighborhood radii t and r, are parameters controlling the complexity vs. deinterleaving 
performance trade-off of the heuristic. For our experiments, we found that R — 5, N — 15, 
t = 1, and r = 2, yielded performance virtually identical to a full exhaustive partition search, 
with orders of magnitude reduction in complexity^] 

6 In fact, to keep running times reasonable, the exhaustive search was given the benefit of limiting the search space to partitions 
n with |il| < 4. No such limitation was assumed for the heuristic scheme, whose search space included, in principle, partitions 
of any size |II| < \A\. 
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Fig. 3. Deinteiieaving success rate vs. sequence length for various IMPs and deinterleavers. 

The results of the experiments with memoryless switches are summarized in columns 2-4 of 
Table [TJ The table shows that the proposed ML-based scheme (in either variant) achieves better 
than 80% deinterleaving accuracy for sequences as short as n = 2500, with perfect deinterleaving 
for n > 15000, whereas the scheme of Q, although fast, requires much longer sequences, 
correctly deinterleaving just one sequence in 200 for n = 5000, and achieving 98% accuracy for 
n = 10 6 (the maximum length tested in the experiments). This comparison is illustrated by the 
curves labeled @ and (2) in Figure [3] 

In the second set of experiments, we used, for each sequence, the same component processes as 
in the first set, but with a switch P w of order one (i.e., k = (1, 1, 1, 1)), with random parameters 
and uniform marginal subalphabet probabilities. The results are presented in columns 4-5 of 
Table |I} and plotted in the curve labeled (3) in Figure [3} We observe that the additional structure 
resulting from the switch memory allows for improved deinterleaving performance for shorter 
sequences: better than 60% accuracy is obtained for sequences as short as n = 500, while perfect 
deinterleaving is obtained for n > 5000. A comparison with the scheme of flD is omitted in 
this case, as the determination of appropriate statistics thresholds (not discussed in [1]) appears 
more involved than in the memoryless switch case, and is beyond the scope of this paper. 

Finally, in a third set of experiments, we maintained switches of order one, but let the 
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component process P 1 in each case be memoryless (i.e., k = (0, 1, 1, 1)). Recall that, by 
Lemma [T] the resulting IMPs in this case have ambiguous representations. Results for the 
heuristic ML-based scheme are presented in columns 6-7 of Table |I} which list the fraction 
of sequences of each length for which the deinterleaver picked the canonical partition, or any 
compatible partition, respectively. We observe that, except for minor deviations for the shorter 
sequence lengths, the deinterleaver consistently picks the canonical partition, as expected from 
Theorem |4} The fraction of sequences for which the canonical partition is chosen is plotted in 
the curve labeled in Figure [3] Memoryless components are excluded in [1], so a comparison 
is not possible in this case. 



Recalling the second remark at the end of Section IV we note that any nonnegative value 
of the penalization constant (3 would have sufficed for the ML schemes in the first two sets of 
experiments, since the IMPs considered have unique representations, and the order of all the 
processes tested was bounded by I. However, a positive value of f3 is required to recover the 
canonical partition (and from it, all compatible partitions) in the case of the third set. For shorter 
sequences, a value of (3 as small as possible is preferred to exclude non-compatible partitions, 
while a value of (3 as large as possible is preferred to recover the canonical partition. Overall, a 
value (3 = | worked well in practice in all cases, providing the best trade-off for shorter sequence 
lengths (clearly, the choice becomes less critical as the sequence length increases). This value 
of f3 is smaller than the value employed in the proof of Theorem |4j In general, the question 
of determining the minimal penalty that guarantees consistent deinterleaving remains open. The 
situation bears some similarity to the one encountered with Markov order estimators: while it 
is known that (3 = \ guarantees strong consistency in all cases, it is also known that much 
smaller penalization constants (or even penalization functions o(logn)) may suffice when the 
process order is bounded The general question of the minimal penalization that guarantees 
consistent unbounded order estimation is, also in this case, open J6J. 

Appendix A 

Uniqueness of IMP representations: derivations 

A. Derivation of Theorems [7] and [2] 

Theorems [T] and [2] will be established through a series of lemmas. The first one (Lemma [2] 
below) captures some essential properties of the interleaved process P=Xn(Pi, P2, ■ ■ ■ , P m \ Pw) 
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and of the domination relation, which we will draw upon repeatedly in the sequel. These 
properties follow immediately from our ergodicity and independence assumptions. Intuitively, 
the key point is that if A\ ~f\ A 2 , the interleaved system can always take a trajectory (of positive 
probability) where it reaches an arbitrary state s of Pi, and then, without returning to A\, visits 
any desired part of A 2 any desired number of times (while the state of Pi remains, of course, 
unchanged). The last segment of the trajectory, with an unbounded number of occurrences of 
A 2 , can be chosen independently of s. For ease of reference, these observations are formally 
stated in the lemma below, where N a (z) denotes the number of occurrences of a symbol a in a 
string z. 

Lemma 2: Consider the subalphabets Ai,A 2 E n, and assume A 1 A 2 . 

i) Let Mi and M be arbitrary integers. There exist strings U, V E II* such that P W (UV) > 0, 
N Al (U) > Mi, N Al (V) = 0, N A2 (V) > M, and P w (Ai | UV) > 0. 

ii) Let M 2 be an arbitrary integer, let s be an arbitrary state of Pi, and consider an arbitrary 
subset B 2 C A 2 and an integer Mi > k\. There exists an integer M > M 2 , and strings 
u, v E A* such that uv is consistent with UV (with \u\ = \U\), where U and V are the 
strings obtained from Part|I]) for these values of Mi and M, u[Ai] = u's for some u' E A\, 
\ V [B 2 ] | > M 2 , and the choice of v does not depend on s (in particular, the same v can be 
chosen for any s E S(P{) ). 

Proof: Part|i]) follows from the ergodicity of P w , the positivity of both P w (Ai) and P W (A 2 ), 
and the definition of domination. The existence of the desired string u in Part [n]) follows further 
from the independence of the component and switch processes, and from the ergodicity of Pi 
(in particular, the fact that Pi(s) > 0). Relying also on the ergodicity of P 2 , we obtain the string 
v. The value of M is determined by how many times v must visit A 2 to obtain M 2 occurrences 
of symbols in the subset B 2 . The independence of v from s follows from Q, which allows us 
to substitute any string over A\, of positive probability, for u[Ai] in uv, resulting in a string uv, 
with P{uv) > 0, u compatible with U, and u[Ai] ending in any desired state of Pi. ■ 
For succinctness, in the series of lemmas and corollaries that follows, we assume throughout 
that we are given an ambiguous IMP, P = X n (Pi, P 2 , . . . , P m \ Pv) = Zw(P{, P 2 , ■ ■ ■ , P' m >\ Pi), 
where II = {A 1 , A 2 , . . . , A m } and II' = {A[,A' 2 , ... , A' m ,} are partitions of A, with II ^ II'. 
Clearly, for at least one alphabet A{ we must have A% E~ II', so we assume, without loss of 
generality, that Ai ^ II', and, furthermore, that A\ C\A' X ^ <p. Also, we say that two subalphabets 
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Ai, Aj E II share a subalphabet A' e E II' if A' £ intersects both A4 and Aj. 

Lemma 3: Assume that A 2 shares A[ with A\, and Ay A 2 . Then, for all a E A\ D 
Pi (a I s) is independent of s e «S(Pi). 

Pn?o/- Let a E A x n A[, and s E <S(Pi). Let £7, V E IT* and u,v E A* be the strings 
guaranteed by Lemma [2] for the given state s, M 2 = ord(P 1 '), and B 2 = A 2 fl A[. Recall that 
v can be chosen independently of s, and |f[P2]| > ^2 = ord(P{). Let v = f^], and let 
[/V = Aji'(uv). Then, applying ([I]) separately to each of the two given IMP representations 
of P, and noting that \v\ > \v[B 2 ]\ > ord(P[), we have 

P(a\uv) = P 1 (a\s)P w (A 1 \UV) = P'M^P^A^U'V'). 



Now, recalling that P w (Ai\UV) > by Lemma [2jjTJ), we obtain 

l ) ~ P,{Ax\UV) ' 
which is independent of s. ■ 
Lemma 4: Assume that A\ C Ai, v4' 2 fl 7^ 0, and A[ A' 2 . Then, P[ is memory less. 
Proof: The lemma follows by applying Lemma [3] with the roles of IT and IT' reversed, and 
observing that A[ n A x = A[. U 
Lemma 5: Assume that A x ^1 A 2 and A[ C A x . If v4' 2 e II', and v4' 2 ni 2 ^^, then A[ ^ A' 2 . 
Proof: We apply Lemma [2} referring only to the strings V and v guaranteed by the lemma, 
and with B 2 = A 2 D A' 2 . Thus, for any integer M 2 , there exists a string V Eli* and a string 
t> consistent with V such that M 2 < MP2] < |t>[v4' 2 ]|, while Na^V) = and, consequently, 
\v[A[} \ = 0. Letting V = A w {v), we then have N A ^ (V) = and N A > 2 (V) > M 2 for arbitrarily 
large M 2 . Thus, A\ ^ A' 2 . ■ 
Lemma 6: Assume that A\ is not totally dominant, A[ C Ai, and P( is memoryless. Then, 
for all a E A[, Pi(a\s) is independent of s E S (Pi). 

Proof: Since m > 1 and Ai is not totally dominant, there exists a subalphabet, say A 2 E IT, 
such that Ai A 2 . Consider a symbol a E A[. Let s be an arbitrary state of Pi, and let U, V, u, 
and v be the strings guaranteed by Lemma[2]for the state s, with M 2 = max{ord(P w ), ord(P4)}. 
Then, applying ([TJ) to the two IMP representations under consideration, we have 

P(a\uv) = PMs)P™{MUV) = P^P^A^U'V), (14) 
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where U'V = Ajj'(uv), and we have relied on the fact that P[ is memoryless. Recall from 



Lemma \2)j$ that P^A^UV) > 0. By our choice of M 2 , it follows from ([14]) that Pi(a\s) = 
P{(a)Pi,{A' 1 \V , )/P w {A 1 \V), which is independent of s. U 

Lemma 7: Assume that A 1 does not dominate any subalphabet Aj, j > 1, that shares some 
A' e G IT with A\. Then, either Pi is memoryless, or A\ splits into subalphabets in II'. 

Proof: Assume that A\ does not split into subalphabets in II'. Then, there exists a 
subalphabet A' e G II' that intersects A 1 but is not contained in it, so A\ shares A' £ with some 
Aj, j > 1. By the lemma's assumptions, we have A\ ~j\ Aj. Therefore, by Lemma [3} P\[a\s) 
is independent of s G S(Pi) for all a G A\ fl A' e . Assume now that there is also a subalphabet 
A\ G II' such that A\ C A 1 . By Lemma |5j we have A\ ~j\ A' e , and, therefore, by Lemma |4} 
P/ is memoryless. Thus, by Lemma [6[ Pi(a\s) is independent of s also when a E A[ C A\. 
Consequently, if A\ does not split in IT, since every a G At must belong to some A' h G IT', 
and P 1 (a\s) is independent of s G <S(Pi) whether is contained in Ax or not, must be 
memoryless. ■ 

Lemma 8: Assume that A\ is not totally dominant, and that A\ does not dominate any 
subalphabet Aj, j > 1, that shares some A' e with A\. Then, P 1 is memoryless. 

Proof: If Pi is not memoryless, then by Lemma|7} A\ splits into subalphabets in IT. Thus, up 
to re-labeling of subalphabets, we have A\ = A[UA' 2 U- ■ -UA' r , where A\ G IT, 1 < % < r < m! , 
with r > 1. Furthermore, by Lemma [6j at least one of the A\, say A[, is not memoryless (for, 
otherwise, P 1 would be memoryless). By Lemma [4j A[ must dominate all A\, 2 < i < r, and in 
particular, A\ □ A' 2 . It follows from this domination relation that there exists a string U' G (IT)* 
such that P' W (A' 2 \U') = 0, and P^AWU') > 0. By the ergodicity of P' w , we can assume without 
loss of generality that the number of occurrences of subalphabets A[, A' 2 , . . . , A' r in U' is at least 
k\ = ord(Pi). Let u be a string consistent with U'. We have |w[^i]| > k\, let t G S{P\) be 
the suffix of length k\ of Consider a symbol b G A 2 , and let U" = A n (w). Applying ([T]) 

separately to the two available IMP representations of P, we have 

P(b\u) = P 1 (6|t)P w (A 1 |C/ // ) = P^b\u[A' 2 ])P^A' 2 \U') = 0, (15) 

where the last equality follows from our choice of U'. On the other hand, since we also have 
P^(A[\U') > 0, we must have P(a\u) > for some a G A' t C A x , and, therefore, P^A^U") > 



0. Thus, it follows from ( 15) that Pi(b\t) = 0. By our assumptions on component processes, there 
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must also be a state s G S(P%) such that P x (&|s) > 0. Since Ax is not totally dominant, there exists 
a subalphabet, say A 2 , such that A\ ~j\ A 2 . Let B 2 = A 2 and M 2 = max{ord(P w ), ord(P4)}. 
We apply Lemma [2jjITj), separately to the states s and t, choosing the same string v for both 
as allowed by the lemma. Specifically, let U and V be the strings over II obtained from the 
lemma, and let vP\ u^ s \ and v be strings such that = u't, u^[Ai) = u" s for some 

u', u", both u( s >v and vF'v are consistent with UV, and \v [A2W > M 2 . Let V = Aw(v). Clearly, 
1^1 = \V'\ > M 2 , so V and V determine states in the respective switches. Applying ([T]) again, 
we obtain 

P(b\u^v) = P' 2 {b\u {s) [A' 2 })P^A' 2 \V) = P 1 (b\s)P w (A l \V) > 0, (16) 

where the last inequality follows from our choice of s, and the fact that P w (Ai\V) = 
P^A^UV) > by our choice of M 2 and by Lemma[2j|i]). Thus, we must have P^(A' 2 \V) > 0. 
On the other hand, we can also write 

P(b\u®v) = P^b\u {t) [A' 2 })P' w {A' 2 \V) = P 1 (6|t)P w (A 1 |V) = 0, (17) 

where the last equality follows from our choice of t. Since, as previously claimed, P4(^4 / 2 I^') > 0' 



it follows from (17) that P 2 (b\u^[A 2 ]) = 0, which must hold for all b e A 2 , a contradiction, 
since every state of P 2 must have at least one symbol with positive probability (the argument 
holds even if |ww[A 2 ]| < ord(P 2 ), reasoning with marginal probabilities). We conclude that Pi 
must be memoryless. ■ 
The following corollary is an immediate consequence of Lemma [8} 
Corollary 2: Assume that A\ is not dominant. Then, Pi is memoryless. 
Assume now that P w is such that no two alphabets in II are in mutual domination. As discussed 
in Section III-A[ this ensures that □ defines a strict partial order on II. We classify alphabets in 
II into disjoint layers Li, i>0, as follows: Given L , Li, . . . , and assuming that these layers 
do not exhaust II, we let Lj consist of the alphabets that have not been previously assigned to 
layers, and that only dominate alphabets contained in layers L^, < i' < i (e.g., L consists of 
the non-dominant alphabets in II). Since II is finite, and every finite set endowed with a strict 
partial order has minima, Li is well defined and non-empty. Thus, for some r > 0, we can write 

n = L U Li U ■ ■ ■ U L r (18) 
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where the layers L , L x , . . . , L r are all disjoint and non-emptyj^] 

We are now ready to present the proofs of Theorems [T] and [2j which rely on the foregoing 
lemmas and corollaries, and on the classification of alphabets into layers Lj. 

Proof of Theorem^ For the layers in (18) we prove, by induction on i, that Lj C II' for 



< i < r. By the definition of L , alphabets Aj G L are not dominant. Thus, by Corollary [2j 



we must have Aj G IT, since, by assumption (iii), Aj is not memoryless. Hence, L C IT'. 
Assume now that the induction claim has been proven for L , Li , . . . , Lj-i, 1 < z < r. Let Aj 
be any alphabet in Lj. By definition of Lj, only dominates alphabets in layers L#, i' < i. But, 
by our induction hypothesis, alphabets in these layers are elements of IT, and, thus, they do not 
share with other alphabets from IT. Thus, Aj does not dominate any alphabet A h with which it 
shares any A' e . By Lemma [8j we must have Aj G II', since Aj is neither totally dominant nor 
memoryless by the assumptions of the theorem. Hence, Li C II', and our claim is proven. Now, 



it follows from (18) that II C II', and, since both II and II' are partitions of the same alphabet 
A, we must have II = II'. ■ 
Proof of Theorem [2]- Examining the proof of Theorem [T| we observe that when 



Condition (|n]) is removed, any totally dominant alphabet must reside in L r , the last layer in ( 18). 



Furthermore, if there is such an alphabet A io , it must be unique, for otherwise there would be 
alphabets in mutual domination. Thus, we have L r = {A io }, and Aj G II' for all % ^ io, and, 
therefore, A io splits into the remaining alphabets in II' that are not equal to any Aj. ■ 

B. Derivation of Theorem [3] 

We start by proving Lemma [T] of Subsection |III-C and then proceed to present an additional 
auxiliary lemma, and the proof of Theorem [3j 

Proof of Lemma L Assume P x , j G {1,2}, and satisfy (6)-(8). We prove that 
P(u n ) = P'(u n ) for all lengths n and sequences u n G A n by induction on n. For n = 0, the 
claim is trivially true due to the convention P(X) = -P'(A) = 1. Assume that P{u n ~ l ) = P'{u n ~ l ) 
for n > and all u a ^ 1 G A" -1 , and consider a sequence u n = u Tl ~ l u n . Let U n = Au(u n ) and 
[U') n = An'(w n ). an d let S G S(P W ) and S' G S(PL) be the states selected by V n ~ x and (U') n -\ 

7 The layers Li correspond to height levels in the directed acyclic graph associated with the transitive reduction of the partial 
order 
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respectively. Clearly, we have S = ty(S'). By the definition of II', if U n = A,- L , i E {2, 3, . . . , m}, 
then U' n = U n , and we have 

p\u n ) = PX^ 1 )^ , KI« n - 1 ) = P / K" 1 )n(^|5 / )^K|« n - 1 [^])) 

= PK- 1 )P w (A|^)PiK h"- 1 ^])) = P{u n ) , (19) 

where the second and last equalities follow from the definitions of the respective IMPs, and the 
third equality follows from the induction hypothesis and ([8]). On the other hand, if U n = A%, 
then U' n = Bj for some j E {1, 2}, and we have 

P'{u n ) = P\u n ' 1 )P\u n \u n ' 1 ) = P\u n - l )P' w {B 3 \S')P { l 3 \u n ) 

= PK- 1 )Pw(A 1 |S)P 1 (P J )^| = P( M "), (20) 

where, this time, the third equality follows from the induction hypothesis, ([8]), and ([6]) (we recall 
that Pi, P^\ and P^ are memoryless). This completes the induction proof and establishes that 
P' = P. 

To prove the "only if" part of the lemma, we assume that P' = P, and consider a sufficiently 
long, arbitrary string u n such that P{u n ) > 0. Let U' = An'(u n_1 ), and assume first that u n E Ai 



for some % > 2. Then, similarly to (19) (but proceeding from the inside out), and noting that 

A n (w" -1 ) = we can write 

P'(u^ 1 )K(MU')Pi(u n \u n - 1 [A i ])) = P'(u n ) = P(u n ) 

= P(w"- 1 )P w (A|*(C/ / ))P i K|« n - 1 [^])). (21) 



Since P' = P, and P{u n ) > 0, (21 ) can be simplified to 

P , w (A i \U , ) = P w (A i \^(U% i e {2,3,...,m}, (22) 
for arbitrary U' e (n') n_1 of positive probability. Consider now the case u n = b e Bj, j E {1, 2}. 



Then, in analogy with (20), we write 

P'iu^P'^B^P?^) = P'(u n ) = P(u n ) = P( M "- 1 )P w (A 1 |M/(f/ / ))^(&) • (23) 
Adding over all b E Bj and simplifying, we obtain 

P^{Bj\U') = P w (A 1 |*(C/ / ))Pi( J B i ), j E {1,2}, (24) 
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again for arbitrary U' . Conditions (jTj) — ([8]) now follow readily from (22) and (24) (which imply, 
in particular, that k w = k' w ), and Condition ^ follows by substituting the right-hand side of (24) 
for P^BjlU') in Q and solving for p[ j \b). ■ 

We say that the representations X n and X n ' of an IMP P coincide up to memoryless components 
if the set of component processes of positive order is the same in both representations. The 
following lemma establishes the uniqueness of canonical partitions. 

Lemma 9: Let X n and X H ' be IMP representations of a process P that coincide up to 
memoryless components, and such that both are canonical. Then, II = II'. 

Proof: Assume that II ^ IT, and let II" be the smallest common refinement of II and IT' 
(i.e., II" = {Ai n A'j | Ai £ n, A'j £ IT, A { n 7^ </>}). By repeated application of Lemma [T] 
there exists an IMP representation Xjj"{P[' , P 2 , ■ ■ ■ , P'm"'i Pw) °f P- This representation is a 
memoryless refinement of both X n and X n '. Since II ^ II', there exists an alphabet, say A[ £ II' 
such that A[ £ II, P[ is memoryless, and we can assume without loss of generality that A[ 
intersects at least two alphabets, A\ and A%, in II (otherwise, we can switch the roles of II and 
IT). Let P>i = A[ n At and B> 2 = A[ n A 2 , so that B 1} B> 2 £ II". Applying Lemma [T] separately 
to Xq and to Xn' with respect to the refinement Xn", we can write, for any 5"' £ S(P^), and 
denoting S = #n,n»(S") and S' = ^n',w(S"), 

K(Bi\3") = P w S) P 1 (B 1 ) = P; {A[\ S') PiiBt), 

where Pi (Pi) and P[(Bi) are nonzero. (Notice that the equation holds also when B\ = A\, i.e., 
when A 1 is not actually refined in II".) Therefore, we can write 

PAMS) = ™%£™ . (25, 
Using a similar argument for B 2 and A 2 , we obtain 

P W (A,\S) = ^ . (26) 

It follows from ^25) and (|26) that if P' w (A[\ S') = 0, then P w (Ai|S) = P W (A 2 |S) = 0, and, 
otherwise, 

P W (A 2 |S) = P^PjjB^ a 
P w (Axis') P{(Si)Pi(S 2 ) 7 ' 
where 7 > is independent of S" (and of S). Observing that S can assume any value in <S(P W ), 

we conclude, by Lemma [T] and the remarks following its statement, that A\ could be merged 

with A 2 , contradicting the assumption that Xn is canonical. Thus, we must have II = IT. ■ 
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Proof of Theorem [3[ Assume P = P' . Since there are no dominant alphabets in 
either representation, it follows from Corollary [2] that the representations must coincide up 
to memoryless components. It then follows from Lemma [9] that the canonical partitions of 
X n and Xq' must be identical, and, thus, since they generate the same process, we must have 
(Xn)* = (X n ')*- The "if" part is straightforward, since (X n )* generates P, and (In')* generates 

P'. m 

Appendix B 
The deinterleaving scheme: derivations 



We will prove Theorem [4] through the auxiliary Lemmas 10 and 11 below, for which we need 
some additional definitions. 

Let F = (S, s , /) be an FSM, and let P and Q be processes generated by F, such that P is 
ergodic. The divergence (relative to F) between P and Q is defined as 

D(P\\Q) = J2P(s)D{P(-\s)\\Q(-\s)), (27) 

where P(s) denotes the stationary probability of the state s E S, and D(P(-|s) |C}(-|s)) denotes 
the Kullback-Leibler divergence between the conditional distributions P(-\s) and Q(-\s). It is 



well known (see, e.g., lfT7l ) that D(P\\Q) as defined in (27 ) is equal to the asymptotic normalized 



Kullbak-Liebler divergence between the processes P and Q, namely, 

D{P\\Q) = lim - V P(z n )lo 



n->oo n ^— ' Q(z n ) 

Let V(Xn,k) denote the set of parameter vectors corresponding to ergodic unconstrained 
FSMSs based on Xn,k, and let V(Xn,k) denote its topological closure. Assuming full parametriza- 
tion, this set is a convex polytope in /C-dimensional Euclidean space. The boundary of V(Xn,k) 
consists of parameter vectors with certain transition probabilities set to zero or one. Some 
of these vectors do not correspond to ergodic FSMS, namely, those that make some of the 
marginal probabilities of states in S vanish (e.g., parameter vectors where the probabilities of 
all the transitions leading to a state vanish). Let Vx(Xn,k)> in turn, denote the set of parameter 
vectors of IMP-constrained FSMSs based on Xn.k, and Vx(Xn,k) its topological closure. The set 
Vi(Xn,k) is a closed /{-dimensional hypersurface within V(Xn,k), determined by the parameter 
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relations implicit in <|3j> . As before, boundary points in Vi(J-n,k) are either in Vx(J-h,k)> or do 
not correspond to valid IMPs. We shall make use of these relations in the sequel. 
The following lemma will be useful in proving the first claim of Theorem |4| 
Lemma 10: Let P = Xu(Pi, P2, ■ ■ ■ , P m ', Pw), and let k = (ki, k 2 , . . . , k m , k w ) be the 
corresponding order vector. Let II' be a partition of A incompatible with P, and k' an arbitrary 
order vector of dimension |n'| + 1. Then, for a sample z n from P, and for any (3 > 0, we have 

Cn',k'{z n ) > Cu,k(z n ) a.s. as n ->■ 00 . 
Proof: Let F + be a common refinement] of F = Fu,v. and F' = J^'.k'- Let V = V(F + ) 



denote the space of all valid parameter vectors for FSM sources based on F + , and let V(F + ) 
denote its topological closure. The constraints satisfied by IMP sources based on F and F' are 
extended to their representations in V (notice that a refinement increases the dimension of the 
parameter vector by "cloning" parameters, together with their constraints). Thus, as mentioned 
in the discussion immediately preceding the lemma, the set of all IMP-constrained FSMSs based 
on F' maps to a lower-dimensional hypersurface V = Vz(F + ) C V, with closure V . We claim 
that the representation of P in V is outside the closed hypersurface V , and, thus, at positive 
Euclidean (or L{) distance from it. To prove the claim, we first notice that since II' is, by 
assumption, incompatible with P, no valid IMP-constrained assignment of parameters for F' 
can generate P, and, thus, P V. Furthermore, since points in V \V correspond to "invalid" 
IMPs with unreachable states, we must have P £ V , and, therefore, P is at positive distance 
from V , as claimed. The ergodicity of P also implies that, in its representation in V, all the 
states of F + have positive stationary probabilities. Applying Pinsker's inequality on a state by 



state basis in ( |27| ) for F + , we conclude that for any process P' G V, we have 

D(P\\P')>A, (28) 

for some constant A > 0. Now, recall that Pp + (z n ) denotes the ML probability of z n with 
respect to F + for an unconstrained FSMS. It follows from the definition of Pp + (z n ) and of the 



divergence in (27) that for any process Q generated by F + , we have 

- log Q(z n ) = - log P* F+ (z n ) + nD (P* F+ \\Q). (29) 

8 It is always possible to construct a common refinement of two FSMs, e.g., one whose state set is the Cartesian product of 
the state sets of the refined FSMs. 



29 



In particular, since F + can generate any process that either F or F' can generate, it can assign 
to z n its IMP-constrained ML probabilities with respect to F and F' which are, respectively, 
Pn,k(^ n ) = 2~^ n - k ^ and Pn>,u(z n ) = 2- j W(* n >. Applying (g to Q = P n , k and Q = Ai',k' 
separately, subtracting on each side of the resulting equations, and dividing by n, we obtain 

\ (H n , >k ,(z n ) -H u ,Uz n j) = D(P* F+ \\Pn', k >)-D(P* F+ \\P nik ) . (30) 

Now, since z n is a sample from P, the empirical measures P F+ and Pn,k ten d to the true 
process P almost surely in the divergence sense, i.e., P(P£ + ||P) — >■ and P(Pn,k||P) — > 
a.s. as n — »■ oo. Also, an empirical conditional probability value in either P£ + or Pn,k is surely 
zero if the corresponding parameter in P is zero, and almost surely bounded away from zero 
otherwise. Hence, we also have P(P*+||P n ,k) -»■ a.s. as n — > oo. On the other hand, since 
Pn',k' € V, (28) applies with P' = Pwy, so we have P(P||P n ',k') > A > 0, and, using 
a similar convergence argument, D(P F+ 1 |Pn',k') > A > a.s. as n — > oo. Thus, it follows 
from Q that 

- ^n',k'(^ n ) - J Pn,k(^ n )) > A > a.s. as n ->■ oo, 

which implies, by ( [TT] ), 

^(<^n',k'(^ n ) - Cn,k(^ n )) > A > a.s. as n^oo , (31) 

since the contribution of the O(logn) penalty terms to the costs vanishes asymptotically in this 
case, for any choice of (3 > 0. ■ 
The following lemma, in turn, will be useful in establishing the second claim of Theorem |4j 
Lemma 11: Let II, II', X n and X n ' be as defined in Lemma [T] so that Z n / is a memoryless 
refinement of In- Let k = (0, &2, . . . , k m , & w ) be the order vector corresponding to In, and 
k' = (0, 0, k-2, ■ ■ ■ i k m , k w ) that of In'- For a sample z n from P, and an appropriate choice of (3, 
we have: if k w > 0, then 

Cn',k'(z n ) > Cn,k(z n ) a.s. as n -> oo , (32) 

while if & w = 0, then 

C n ^(« n ) = C n ^ n ). (33) 

Proof: We first notice that, by Lemma [TJ Pr(J-n,k) can alternatively be characterized as 
the subset of Vt{J~w,\l') formed by distributions such that the switch process P' w satisfies the 
following two constraints, where ^ denotes the mapping defined prior to Lemma [T] 
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a) If S', S" £ S(P^) satisfy = $(S") then the corresponding conditional distributions 
coincide; 

b) For every 5 £ S(P^), P^(B 2 \S) = ^P^(Bi\S) for some parameter 7, independent of 5. 
Clearly, the dimension of both parametrizations remains k(IL, k). It then follows from the 
definition of empirical entropy of an IMP and from ( [10] ) that 

m 

Hn,^z n ) = Hoiz^B,]) + H (z n [B 2 }) + ^ H h (z t ) - log P w (A a ,(z n )) (34) 

t=2 

where P^(Ajj'(z n )) denotes the ML probability, subject to the above two constraints, of the 
switch sequence An'(z"). Therefore, 

#n,k(* n ) ~ H u >,v(z n ) = - log/* (An/OO) - H kvf (A w (z n )) (35) 
which depends on z n only through A n /(z n ). The above difference is obviously nonnegative, since 



IT is a refinement of II; equivalently, looking at the right-hand side of (35), the maximization 
leading to P^(Au'(z n )) involves more constraints than the one leading to Hk vr (Aw(z n )). 



Recalling the difference in model sizes computed in ( |T3[ ), we obtain, together with ( [35] ), that 

Cn>,v{z n )-C n ,*{z n ) = H k „(A w (z n )) + (3m{m + l) fcw log(n + 1) 

- [-logP;(A n; (^)) + /3((m-l)m fc » + l)log(n+l)]. (36) 



Thus, the left-hand side of (36) is equal to the difference between penalized ML probabilities 
for a switch sequence of length n on IT', for two candidate models. The first model is Markov of 
order k w , whereas the second model differs from the plain Markov one in that states of (n') fcw 
have merged according to the mapping \P, so that the number of states is now m kw (constraint (ja| 
above), and imposes the additional constraint (|b]) on the conditional probabilities of Bi and B 2 
(notice that the number of free parameters in this model is indeed (m — l)m fcw + 1). Since, by 
our assumptions, the number of states of the underlying switch process is m fcw and the process 
does satisfy the additional constraint (|b]), the left-hand side of < [36| ) can be viewed as a penalized 
ML test of two models, the minimal, "true" one, and a refinement of it. When k w = 0, the 



refinement is trivial and the penalty difference is 0, implying (33). When k w > 0, our analysis, 
presented next, will rely on tools developed in [fTTI to study refinements of the type given by 
constraint ([a]), which will be extended here to deal also with the type of refinement given by 
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constraint (|b]). As in Ifm . we will show the strong consistency of the penalized ML test for 
suitable (3. 

Specifically, given a sequence Z n over (IT 7 )™, we start by defining the following "semi-ML" 

Markov probability distribution P^ of order k w : For every S G (il') fcw and % = 2, ■ ■ ■ ,m, we 

define P^(Ai\S) = P w (Ai\S) if S G Il fcw (i.e., S is a fc w -tuple over (n') few not containing either 

Bi or B 2 , and is therefore an unrefined state of n fcw ), and P^(Ai\S) = P^(Ai\^f(S)) otherwise, 

where P^(Ai\S) denotes the ratio between the number of occurrences of A { following a state S 

in Z n , and the number of occurrences of S, where S can be either in Il few (as is ^f(S) in this 

case) or, more generally, in (W) k ™ . The distribution is completely determined by further setting, 

for every S G (IT)^, the relation P! N {B 2 \S) = ^(B^S), where 

A a N B2 (Z n ) 
7 N Bl (Z") 

is the ML estimate of 7 based on Z n , given by the ratio between the number of occurrences 
of B 2 and B 1 in Z n (independent of S), provided N Bl (Z n ) > 0. Otherwise, if N Bl (Z n ) = 0, 
we let P^B^S) = 0. Notice that P^A^S) is the ML estimate of P^A^S) regardless of 
the constraint relating P^(B2\S) and P^(Bx\S). Since, in order to obtain the (constrained) ML 
probability P'^{Z n ), one can first maximize over 7 and then perform independent maximizations 
of the conditional probabilities for each state, it is easy to see that, for any Z n G (IT 7 )™, we have 

P' w (Z n ) < K(Z n ) < P^(Z n ) (37) 

justifying our reference to P^ as a "semi-ML" Markov probability distribution. 

Another (non-constrained) "semi-ML" Markov probability distribution P^ of order k w is 
defined as follows: For every S G (n') fcw nll fcw we define P^A^S) = P w (Ai\S), i = 2, • • • ,m, 
and P^(B2\S) = jsPwiBilS), where 75 denotes the ratio between the number of occurrences 
of B 2 and B\ following state S in Z n , provided the latter number is positive (otherwise, 
we let P4(-Bi|S0 = 0). For all other states S G (II') and every Z G II', we define 
h{Z\S) = F> W (Z\S). 

Notice that for states in (Jl') kw fl Il fcw , P^ differs from P^ in that the ratio between the 
conditional probabilities of B 2 and Bi depends on S (while the conditional probabilities of all 
Ai, i = 2, ■ ■ • , m, under the two measures, coincide, and are independent of Z n ). For the other 
states, both P^ and P^ use ML estimates (which are constrained for the latter distribution). The 
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key observation is then that 

- logP;(A n ,(^)) - H kv (Auiz n )) = - log P4(A n ^ n )) + logP;(A n ^ n )) . (38) 
Now, the probability P eri (ri) of the error event is given by 

PeAn) ± Yl P ^ n ) = E P w(^ n ) (39) 

where £ denotes the subset of switch sequences Z n over (IT')" satisfying 

H kv (Z n ) + /3m(m + l) fcw log(n + 1) < -logP4(Z n ) +0[(m- l)m fcw + 1] log(n + 1) 



and the second equality in ( [39] ) follows from ( |36| ). By ( [38] ), Z n G £ if and only if 

- logP4(Z n ) > - logP4(Z n ) + f3[m(m + 1) K - (m - l)m K - 1] log(n + 1) 
or, equivalently, 

P (Z n ) < (n + ]_)~/ 3 [ m ( m + 1 ) fcw_ ( m ~ 1 ) m ' =w_1 ]p / (Z™) 



Therefore, by the first inequality in pi) , the rightmost summation in ( ]39| ) can be upper-bounded 
to obtain 

P crr (n)<{n + l)~P [m{m+1)k ™-( m - 1)mkv ~ l] K(Z n ). (40) 

z™g(n')" 

Notice that the probability distributions in the summation in the right-hand side of ( [40] ) depend 
on Z n . Clearly, when restricted to sequences Z n giving rise to the same distribution, the partial 
sum is upper-bounded by 1. Therefore, the overall sum is upper-bounded by the number N of 
distinct such distributions. Now, there are (m + l) fcw — (m — l) fcw states given by /c w -tuples 
containing either B\ or B 2 and, by the definition of P^, for each of these states there are at most 
(n + l) m+1 possible conditional distributions, given by the composition of the corresponding 
substring in Z n . For each of the remaining (m — l) few states, the definition of P' w implies that 
there are at most (n + l) 2 possible conditional distributions. Therefore, 

j\T < ( n -f ]_~)2(m-l) fe w + [(m+l) fc w_( m _i)fcw]( m+1 ) 

implying 

P {n) < (n + x) 2 ( m_1 ) fcw +[( m + 1 ) few_ ( m_1 ) few K m +i) _ / 3 [ m ( m + i ) few_ ( m_ i) mfew_ i] (41) 
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Since m > 2 and k w > 1, it can be readily shown that, for any > 3, the exponent in the 



right-hand side of (41 ) is less than —1. Thus, P e rr(^) is summable and the result follows from 
the Borel-Cantelli lemma. ■ 
With these tools in hand, we are now ready to prove Theorem |4j 
Proof of Theorem Define the set 

IT = { (IT, k') | IT is incompatible with P } . 

To establish the first claim of the theorem, we will prove that ^fl(,2 n ), k(z n ) J ^ II' a.s. as 
n — > oo. Consider a partition fl compatible with P, denote by k the associated order vector, 
and let R = «(5, k). Let k > R denote a threshold for model sizes, which is independent of n, 
and will be specified in more detail later on. Write II' = IIi U II 2 , where 

ni = {(n',k , )en'|«(n',k') <«<,}, 



and n 2 = II' \ IIi. Clearly, II ! is finite and its size is independent of n. By Lemma 10 for 
each pair (II', k') £ IIi, we have Cn>,k>(z n ) > Cjj^z" 1 ) a.s. as n — > oo, for any penalization 



coefficient (3 > 0. Thus, the search in (12), almost surely, will not return a pair from IIi. It 
remains to prove that it will not return a pair from Tl 2 either. As mentioned, the difficulty here 
is that the size of n 2 (and of the IMP models associated with pairs in n 2 ) is not bounded 
as n — > oo, and we cannot establish the desired result with a finite number of applications of 
Lemma [10} As before, we adapt some tools from IfTTl to IMP-constrained FSMSs. 

For (IT, k') £ II 2 , let Vw,h' denote the probability that a solution with (II', k') is preferred 
over (fl, k) in the minimization. Define 

B n >,v = {z n \C w> v(z n ) <C fl)it «)} . 

Clearly, we have 

?w< Yl p ^ n )- (42) 



By the definitions of -E>n',k' and of the cost function in (11), and denoting k' = k(W,]s!), we 
have, for z n £ -Bn',ks 

£n,k(^ n ) > Hw,y(z n ) + P(k - R) log(n + 1) . (43) 
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Recalling that P(z n ) < P n ,k(z n ) by <[10j), and that H n > y (z n ) = 
from (|43]) that 

P{z n ) <(n + l)^')iVk'(^) , z n e Sip,* > 



logP n / ik /(2; n ), it follows 



and, hence, together with (42), and applying an obvious bound, we obtain 



(44) 



z n eB 



n'.k' 



In analogy to the reasoning following (40) in the proof of Lemma [TT} the summation on the right- 



hand side of (44) can be upper-bounded by the number of different empirical distributions (or 
types) for EVIPs based on J-"n',k' an d sequences of length n. It is well established (see, e.g., |[T8l ) 
that (cKj — counts suffice to determine the empirical distribution for the Markov component 
Pi (and similarly for the switch P w ). Hence, recalling ([5]), we conclude that k' = k(U.', k') 
counts suffice to determine an empirical distribution Pu',k'{z n ), and, therefore, the number of 



such distributions is upper-bounded (quite loosely) by (n + Thus, it follows from (44) that 

Vn>p <(n + if '(*-«')+«' . (45) 

We next bound the number of pairs (n',k') satisfying k{U! 1 \s!) = k' for a given k' > k . The 
number of partitions II' is upper-bounded by a a , where a = \A\. For a given partition, with, 
say |n'| = m, we need an assignment of process orders k[, i E {1,2,. . . , m, w}. If = 1, 
the only valid assignment is k\ = 0, while if > 2, we must have k\ < logs'. Thus, since 
m < a, the number of pairs sought is upper-bounded by a a (log n') a+1 . We notice also that, for 
z 11 E Bji'y and sufficiently large n, we must have k' < n (actually, k 1 = o(n)), for otherwise 
the penalty component of Cn>,k'(z n ) on its own would surpass Cf l ^(z n ), which is 0(n). Hence, 
for sufficiently large n, denoting by P eTI (n) the probability of a pair from II 2 prevailing over 



(II,k) in (12), and observing that a a (log(n + l)) a+1 < (n + i)«i°s«+«+i for n > 1, it follows 



from (45) that 



(n',k'):«'>Ko 



K'=KQ 



< (n + l) K '( 1 -/ 3 )+/ 3 «+ al °S a + Q + 1 < ( n _|_ -QKo(l-/3)+/3/c+a 

k'=kq 

where the last inequality holds for /3 > 1. Choosing k > ^+ al °g°+ Q + 3 ; we get 



a+a+2 



P e Jn) <(n + l) 
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for a constant 5 < — 1. Therefore, P sri {n) is summable, and, applying again Borel-Cantelli's 
lemma, (ft, k) ^ n 2 a.s. as n — > oo. We conclude that (ft, k) is compatible with P a.s. as 
n — > oo, as claimed. The fact that k is, almost surely, the correct order vector follows from 
the well known consistency of penalized ML estimators for Markov order [6| (recall, from the 



discussion following (12), that the order of each subprocess is estimated independently). 

The second claim of the theorem is proved by applying Lemma [TTJ which implies that in 
the domination-free case, the canonical partition beats other compatible partitions with more 



subalphabets. When k w >0, this follows from (32), while when k w =0, it follows from (33) and 
our tie-breaking convention. ■ 

Acknowledgment. Thanks to Erik Ordentlich and Farzad Parvaresh for stimulating discussions. 
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